Abstract: Due to the increased digitalization of information, a huge amount of data is being generated. Information richness in such data has attracted researchers to this data. The major problem existing in real time data is that it is fast and streaming making analysis on them difficult. Intrusion detection is a continuous process and depending on the size of the network and the number of transmissions being carried out in the network, the number of packets to be analyzed varies considerably. The packets being transferred tends to be fast, hence a mechanism to provide analysis in real time becomes mandatory. This paper presents a tree based technique to analyze network traffic and provide real time predictions with higher accuracy. It uses an ensemble of trees called the Random Forest classifier. Experiments were conducted on Hadoop platform using Spark. Spark, being a stream processing framework exhibits effective results in real-time.

Keywords: Classification; Anomaly Detection; Network Intrusion Detection; Hadoop; Spark; Random Forest.